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Abstract 

The Minimum Description Length (MDL) principle states that the optimal model 
for a given data set is that which compresses it best. Due to practial limitations the 
model can be restricted to a class such as linear regression models, which we address 
in this study. As in other formulations such as the LASSO and forward step-wise 
regression we are interested in sparsifying the feature set while preserving generalization 
ability. We derive a well-principled set of codes for both parameters and error residuals 
along with smooth approximations to lengths of these codes as to allow gradient- 
descent optimization of description length, and go on to show that sparsification and 
feature selection using our approach is faster than the LASSO on several datasets 
from the UCI and StatLib repositories, with favorable generalization accuracy, while 
being fully automatic, requiring neither cross-validation nor tuning of regularization 
hyper-parameters, allowing even for a nonlinear expansion of the feature set followed 
by sparsification. 



1 Introduction 
1.1 MDL 

The Minimum Description Length (MDL) principle is one of many methods that have been 
proposed to tackle the general model selection problem through principled balancing of the 
cost of storing a given model's prediction errors on a given data set and that of storing the 
model's parameters. Based on algorithmic information theory, it was first formulated by 
Rissanen in 1978 [15]. Other methods include AIC pQ, BIC [IJ and MML [27], which are 
based on Baycsian statistics and classical information theory In the past decades several 
theoretical approaches to MDL have evolved [TU [T5J [55] , while practical implementations 
remain an open research topic [19j , as the potential computational burden MDL can impose 
is significant. 

Practical MDL uses description methods that are less expressive than universal languages, so 
that the length of the shortest description of the data is always computable [TJl [HJ [2TJ [J2] . 
It restricts the set of allowed codes in a manner which allows us to find the shortest code 
length of the data, relative to the set of allowed codes in finite time. So far, three practical 
MDL schemes have been devised [TT] . We used the simplest of them, which is called two-part 
MDL: M = argmin MeA/( (L (D\M) + L (M)), where D is the data, M the model, M the 
model class (the set of allowed models) and L the code length function. If we knew the 
probability distributions p(M),p(D\M), we could use maximum likelihood estimation (the 
problem being in essence a Bayesian learning formulation) or appropriate asymptotically 
compact codes, i.e. Shannon-Fano coding, to generate the code books for L(D\M) and 
L(M) (as it is done in MML). However, we do not know them, so in order to cover a wide 
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set of possible underlying noise probability densities and size of training data sets we used 
a novel approach based on codes we developed ourselves. 

1.2 Regularized linear regression vs. compressive linear regression 

In linear regression, we assume we have data organized in an observation matrix D £ R JxN 
where N is the number of observations and J is the number of features. This matrix may be 
expanded in two ways: by adding functions of groups of observations (therefore increasing 
N), as in kernel expansion, or adding functions of feature groups (feature expansion), thereby 
increasing J. In linear regression it is customary to add at least one constant row to the 
features, corresponding to the bias. Feature expansion can be more generally written as 

X G {{b q , f q : q = 1, 2, 3, ...,p} : b, C {1, 2, 3, J}, /, : Rl b *l -> R 1 } (1) 

Thus X, which we call the feature product, is a subset of matrix functions X C R' 7xJV — > 
R( J +p) xiV operating on groups of rows, as to allow eventual feature selection. The linear 
regression formulation can be written as 

y = X(D)0 + e,y 6 R N ,9 G R K , (2) 

where K = J+p. Depending on our prior knowledge of the probability density function p(e) 
(i.e. of measurement errors) there are various ways in which an approximate solution 9 may 
be sought, by minimizing the appropriate loss function (or log- likelihood). One of the most 
common "non-uniformative" assumptions, consistent with a maximal entropy distribution 
for finite variance i.i.d sampling, is that the errors are Gaussian and the corresponding loss 
function to be minimized is provided by the 2-norm 

= argmin(||y-X( J D)0|| 2 + 7 ||0y (3) 
e 

The additional r-norm is necessary, even without explicit regularization (7 = 1), to cover 
the underdetermined case (K > N), but without regularization £2 regression can very of- 
ten overfit [5] . Various regularization norms have been proposed, the most commonly used 
being ridge regression and LASSO [53]. The Tikhonov regularization [25] proposes a gen- 
eralized 2-norm, but commonly the straightforward £2 norm is used (ridge regression) . The 
LASSO proposes the £\ norm and offers the additional advantage of a tendency to sparsify 
9. Both approaches can be seen, from a Bayesian perspective, to assume a Gaussian error 
distribution p(e) and independent zero-mean priors with uniform variances over compo- 
nents of 9, of Gaussian form for ridge regression and double-exponential for the LASSO. 
The Bayesian interpretation is problematic however: each feature may have different mea- 
surement units (scale), so a likelihood interpretation based on the sum of absolute values 
of the coefficients (each having inverse units of the corresponding feature) is rather non- 
informative. Ridge-regression is likewise not scale invariant. Furthermore, the variance of 
the parameter "prior" depends on the hyper-parameter 7, which is almost always chosen via 
posterior analysis of the data. Although LASSO and ridge regression, for fixed 7, consist of 
convex optimization problems which can be accurately and efficiently solved, they are not 
convex in 7: complete model selection requires computationally expensive bootstrapping 
and cross-validation, while the actual statistic to be minimized depends on the particular 
implementation. 

Our objective was to provide for a scale invariant objective function which allows for mean- 
ingful Bayesian interpretation and does not require hyper-parameter tuning. The MDL for- 
mulation of regularized regression requires, first of all, the weak assumption that the target 
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has finite and uniform measurement precision or quantization width 5{y) (i.e. y/5{y) E Z ). 
We seek 

) + L (0 # )) , where % G Q K (4) 

The difference between equations ([3]) and ([4J is that we are optimizing over rational-valued 
vectors 6#, rather than real- valued ones, as both the parameter prior and the measurement 
error distributions are now discrete probabilities rather than probability density functions. 
Clearly, a direct optimization over rational numbers would be very difficult. This is precisely 
why we develop codes, approximations and bounds to make MDL optimization possible for 
the model class of linear regression models with nonlinear feature products. We shall call 
this method Compressive Linear Regression (CLR). 

2 Methods 

The basic approach we will follow is to compute, as precisely as possible, description lengths 
of regression parameters and residuals such that a smooth objective function which approxi- 
mates the description length of the data can be built. For this purpose, known problems such 
as compact codes for integers, rationals and random integer sequences must be revisited. 

2.1 Coding of integers 

Rissanen's universal prefix-free code (that is, no code word is a prefix of any other code word) 
for integers [17j provides a Bayesian "non-uniformative" prior 2~ i ^, a principle which can 
be extended from the integer set to rational numbers as well, which we will propose in 
this paper. A universal code is a prefix-free code which has an expected code length over 
all integers that is invariant, within a constant, no matter what the prior probability of 
the integers are (given that the latter monotonically decreases). However, Rissanen's code 
is not very compact as defined by the Kraft inequality: J2dev 2 _£ ^ < 1. A compact 
code is a code, for which the Kraft sum is close to one, which implies that only few of all 
possible code words of a certain length are not in the code. For every set of code word 
lengths that satisfy the inequality, there exists a unique decodable prefix-free code with 
the same code word lengths |9]. This is why we can restrict our attention to prefix- free 
codes throughout this paper. Another argument against using Rissanen's code is that the 
area under the curve of the code length function is not convex everywhere, being concave 
at x\ = 2,x 2 = 2 2 = 4, £3 = 2 4 = 16,2:4 = 2 16 ,..., which can cause problems for some 
optimization algorithms. 

Universal codes are the basic building block for the other codes described, therefore we 
chose to build a smooth, compact and convex universal code U n for non-negative integers 
(and consequently for the signed integers as well via the ordering 0,-1,1,-2,2,-3 etc. We 
call the code for the signed integers U). Our code uses two different coding schemes: For 
small numbers, it uses a coding scheme F similar to Fibonacci coding [5] and then switches 
at a certain point to a coding scheme E producing code words with the same lengths as 
Elias-Delta coding [7] for large numbers. In F the first two codewords are and 100. The 
following k codewords are produced by binary-adding the number 1. For k we use consecutive 
Fibonacci numbers, i.e. 1,1,2,3,5,... The fc-th codeword is additionally appended by "0". 
In E, the codeword for a number n consists of the three binaries t, blen and b* . t are 
len — 1 ones, len is the length of the binary representation b(n) of n, blen is the binary 
representation of b(n) with the first bit set to zero and b* is b{n) without the first bit. 
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Figure 1: Left: a code for rational numbers: Code word lengths increase with \9\ and 9/6. 
Right: Smooth approximation function for the a coding function. 



2.2 Coding of rationals or simplest rational within an interval on 
the real line 

Any universal integer code can be used to construct a prefix-free code for rational numbers 
of specified precision. However, there is no obvious ordering scheme for rationals in terms of 
code lengths. Li and Vitanyi offer a method to map rationals into the unit interval [0, 
using the idea of cylinder sets [13] . We implement a variant of this code and name it a-code: 

a : K 2 -» {0, 1}* , a _1 : {0, 1}* -» Q , s.t. 9 # = a' 1 (a(6, 5)) €(9-6,6- 6)) (5) 



where a~ 1 (a(9, 6)) is the rational number coded by the binary a(9,5). Using the code a, 
we can recover a rational number close to the real number 9 to a precision of at least 5. 
We define a — a(9, 6) = U(9s)U(9\ og ), where 9 is the rational number to be encoded, 6 the 
precision to which it should be encoded, 9s = \9/6] and 9\ og = |~log|0|] — [0.5 log n]. The 
decimal value of the code word a(0,S) is 0# = 9g ■ 2^ loges ^ +eios . As a is prefix-free, the 
code for a vector of rationals which are not presumed to be interdependent is simply the 
concatenation of the codes of its elements: a(0, S) = ^ - a(9i, Si). We shall use a to store 
the linear regression parameters, as detailed below. Through numerical tests, we were able 
to find a smooth function a that approximates our a-code lengths, for which we calculated 
an approximation mean error of 0.8 bit by sampling 10 5 evenly spaced points for 9 € [2 ; 2 ] 
and a relative precision 9/8 £ [2~ 8 ;2 ]: 
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-(9) = d- (log(log(0 2 +c 2 ))) -1 



In this formula, erf is the error function and cq, c\ and c 2 are numerically fitted constants 
of 0(1). The alpha code lengths and their smooth approximations are shown in Figure [TJ 



2.3 Spherical coding of a random sequence of signed integers 

We seek a means of coding the part of the data that cannot be explained by the model, 
i.e. the residuals e (a sequence of signed integers). Since only model fitting (compression) 
can determine that the sequence is not random, we shall build a code assuming that each 
element is random (incompressible), independent of each other, and uniform (equivalent to 
an i.i.d sample). 
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One way we could achieve this is simply to assign a uniform length signed integer code to all 
possible values ranging from the observed maximum to the observed minimum. This would 
be equivalent to assuming a uniform error distribution, which would be inefficient as it is 
not likely in practice. Another possibility would be to assign a universal code to each signed 
integer, which would be also inefficient, since we would not be exploiting the (reasonable) 
assumption of uniformity. We chose, therefore, to assume that the likelihood of a sample e 
decreases with its 2-norm (hence the name "spherical code"), and to use a universal code 
for coding of each possible ./V-dimensional value in order of increasing 2-norm. This, we 
will show numerically at least, is equivalent to assuming a Gaussian (normal) distribution 
of noise. 

The coding problem, i.e. a scheme which allows the counting of errors of any size, becomes 
one of counting the maximal expected number of hypercubes with side lengths S(X) that 
intersect a hypersphere of N dimensions ordered spirally outwards from the origin. 

H : N Z N , s.t. \\H(n)\\ 2 < \\H(n + 1)|| 2 Vn E N (7) 

bg(log(ft)) fagWWr))) log(log(/l„)) log(log(S)) 

coon 

log(d) log(ii) log(d) log(c() 

Figure 2: Spherical counting scheme and approximations: The first picture shows the surface 
plot of the exact count h. White indicates the region in which computation of h was not 
feasible. In the second picture we can see the spherical volume. In the lower right region this 
approximation is not accurate. The third picture shows the Shannon coding length for a 
spherical data distribution, where we plot d vs. r/y/d. The reason why this approximation 
is not valid in the region underneath the first angle bisecting line is that r/y/d gets so 
small that the coding length would be estimated as one bit. The fourth picture shows the 
approximation function h. 

So, our spiral counting scheme H(n) guarantees that the 2-norm is monotonically increasing 
and therefore, since the universal description length of an integer is monotonically increasing 
with the integer, it follows that under this counting scheme the description length of an 
integer vector under H is monotonically increasing with its 2-norm. This leads us to state 
that: 



<h(r,N)<\U (max (V s N (r),l))\, (8) 

where Vg(r) denotes the volume of the hyper-sphere of radius r in R . We can establish 
a connection between h and the Shannon entropy h s h(x). It is the shortest average coding 
length for a random variable x of dimension N . For spherical data with variance c 2 , 



h sh (x) = y -log(27re) + iV-log( CT ) (9) 

To be able to compare h s h and h, we have to consider the case of finite data again: We 
cannot simply assume that a is known, but rather we have to actually encode it and consider 
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its coding length. We do so using the a-code. The minimum number of bits that we need 
to store a can be found by minimizing over the precision Act . We call the resulting applied 
Shannon coding length h ap - 



IV I — 

h a Jx) = — ■ log(27re) + min(W • log(cr + Act) + a(a, Act)) « h(r = aVN, N) (10) 

2 Act 

Figure [2] shows that the applied Shannon coding length is approximated by the spherical 
code length function for large N. Shannon- Fano or Huffman code lengths/entropies for a 
given distribution are only valid asymptotically, i.e. for very large N. For finite, small 
N, such coding - based on known or sample standard deviation (SD) - may even result in 
expansion rather than the intended compression. 

2.4 MDL for linear regression 

The complete (loss- less) code length approximation A m of the data to be minimized is: 



X M (9, S, X) = a(9, 8) + h(s 2 M (0#,X), N), (11) 

where = a (a(8, 8)). Bars on a and h denote smooth approximations. Note that 
this equation requires the function sm(-), which relates the size of the residual 2-norm 
to the parameters and their storage precision, for some model M which specifies how the 
parameters encode y, usually represented by a Hessian function. This principle can be 
applied to any type of regression fit: we shall do so for linear regression. 

Finding a close, smooth bound h for any r and N is a difficult mathematical problem, which 
also can be formulated as finding the number of representations of an integer as a sum of 
squares. A general (approximative) function has not been found yet |10j . For combinations 
of large enough r (> 5) and low enough TV (< 10 6 ) (see previous section) the following 
approximation based on the unit sphere volume is reasonable: 

The a code has a rather nice property in that the numerical value of the decoding a -1 (a(0j, Si)) = 
0i + 5(9i) is uniformly distributed on the interval [8i — Si, Qi + Si] with Pi(S) = (25i)~ 1 , over 
all possible pairs of 9 and 5: as Monte-Carlo simulations have confirmed, the code behaves 
without bias of {{0#)i — 9i)/5i. This allows us to take the expected value of the increase in 
the norm of the residual for random, independent perturbations 5{9i) of each 9i with the 
uniform probability Pi(5). Calculating the growth of residual norm (squared) with respect 
to parameter accuracy 8 is straightforward for linear regression: 

**(a-Ha(e,S)),X) = {y-X-a-\a(9,8))) T -(y-X-a-^aiO^))) 

= (y - X ■ (9 + 8(9))f ■ (y - X ■ (8 + 8(8))) (13) 

We choose 8 = (X T X) 1 X T y to minimize the 2-norm of the residual e T e, (or any exact 
solution in the undcrdctermined case) so the gradient is with respect to 8 zero, meaning 
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that any perturbation from results in strictly quadratic growth: e T e + 8(6) T X T X8(8) = 
e T e + 6(0) T J^xS(9), where Ex is the covariance of X. Taking the expected value over 
possible perturbations (i.e. the outputs of the coder-decoder): 



E [8(6^(^x^8(6,)] 

E [6(6^(^x^6(6,)]^ 
E [5(8) T Xx8(9)] 



P (S(6 t ))(^xk l S(6 t ) 2 dS(6 t ) 



(26 i )- 1 ^ x ) i , i St = l -&x)i,i5l 



E 



(14) 



Finally, a diffcrentiable CLR objective function for linear regression can be written. 



nn(E m [\ M (0,5,X)]) = min (a(d, 6) + E m [h(s 2 M + 8(X) T Z X , M 6(X),N)]) 



mm 
e 



£fi(Mi) + &(*L + 5£(E*k< 



(15) 



The approximation E[h(a + x)} = E[h(a) + h'(a)x+ ...} = h(a) + h'(a)E[x} + ... = h(a + E[x]) 
is valid if h(a) is locally linear - being 0(log(a)) that assumption is reasonable. The resulting 
objective function is smooth in optimization parameters (2K such parameters: 6 and 8) but 
is non-convex. A useful unbiased heuristic we have found was to use simplex optimization 
(code was written in Matlab, Mathworks. Inc.) with starting value for 8 at the usual 2- 
norm solution and Si = \&i\/2. After each downhill optimization, all parameters for which 
Si > were discarded and the entire procedure repeated (with reduced X) until no further 
parameters were thus '"culled"'. 



3 Results 

The CLR method proposed was compared to 2 different LASSO variants on both simulated 
and real data. Simulated data allows us to gauge the relative ability of CLR to accurately 
recover the sparsity structure in the case in which it is known. We produced the same 
synthetic data sets as did Tibshirani in his seminal LASSO paper [53] ■ For each of three 
examples, we simulated 50 data sets consisting of 20 observations from the model y = 6 T x + 
ere, where e is standard normal. The correlation between Xi and Xj was p' 1 ^' with p = 0.5. 
The dataset paramters for SIM1 were (3 = (3, 1.5, 0, 0, 2, 0, 0, 0) T and a = 3, for SIM2 
fa = 0.85, Vj and a = 3 and for SIM3 /3 = (5,0,0,0,0,0,0,0) and a = 2. In [23] LASSO 
performed favorably compared to other methods such as ridge regression and therefore only 
LASSO and normal 2-norm regression was performed for comparison. Performance results 
are given in Table 1. for each of the simulated datasets (SIM1, SIM2, SIM3) and methods 
evaluated (CLR - parsimonious linear regression, LASSOLCR - LASSO largest consistent 
region, LASSOCV - LASSO with cross-validation, similar to Tibshirani's cross-validated 
LASSO in [53] which was found to be highly performant). Both variants used bootstrapping 
for LASSO statistic estimation. We used the LASSO implementation available in the Matlab 
repository (www.mathworks.com/matlabcentral/fileexchange) written by M. Dunham and 
described in [4] . The given values are mean running time (as a fraction of L2 running time) 
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over all simulations, and for each dataset: the number of non-zero parameters computed, 
the minimum squared error (MSE) of the computed parameters vs. simulated parameters 
and the square root of the MSE of residuals as a fraction of the size of the noise used for the 
simulation a. The bias value is not included in these calculations. For L2 the full parameters 
set is always returned and therefore 8 vs. 2 means that 2 of the actual parameters for that 
dataset were non-zero. 



Table 1: Results of sparse linear regressions on simulated datascts. 



Method 




CLR 


LASSOLCR 


LASSOCV 


L2 


Time 




112 


204 


202 


1 


nr. of parameters found 


SIMl 
SIM2 
SIM3 


2.10±1.4 
1.26±1.1 
1.72±1.3 


0.32±0.5 
0.04±0.2 
0.90±0.3 


3.40±2.2 
4.08±1.7 
2.68±2.0 


8 vs. 3 
8 vs. 8 
8 vs. 1 


MSE 


SIMl 
SIM2 
SIM3 


1.27±0.8 
1.12±0.4 
0.19±0.4 


1.72±0.4 
0.74±0.1 
0.33±0.9 


1.01±0.6 
1.10±0.6 
0.25±0.4 


1.10±0.8 
1.04±0.8 
0.72±0.8 


Sd{e)/(T 


SIMl 
SIM2 
SIM3 


1.05±0.3 
1.28±0.4 
0.94±0.2 


1.68±0.4 
1.77±0.3 
1.11±0.4 


0.91±0.2 
0.91±0.2 
0.87±0.2 


0.73±0.1 
0.75±0.1 
0.75±0.2 



In order to test generalization performance, we used the following datasets from the UCI 
repository [3]: Automobile, Concrete Compressive Strength [53], Forest Fires [5], Housing, 
Qualitative Structure Activity Relationships (Pyrimidincs and Triazines). We also used 
one more dataset from the Statlib Archives (http://lib.stat.cmu.edu/): Detroit [14]. For 
generalization estimation, we used a training test split of 2/3 to 1/3. A feature product of 
Xfc+i+i = Df was used along with a bias term for all methods. Figure [3] shows training vs. 
test set errors on the 7 datasets evaluated. The axes are standard deviations of residuals as 
a fraction of the standard deviation of the target, for both test (x-axis) and training subsets 
(y-axis). Note that many analyses (especially LASSOLCR) resulted in no features (except 
bias) being selected, i.e. the ratio of s.d.'s equaled 1. The callout in the top right corner is 
a list of such instances. The actual test s.d. ratio value for LASSOCV, dataset 4, was 59.0 
(off scale). The solution sparsity, averaged over all datasets, was 16.7% for CLR, 11.7% for 
LASSOLCR and 37.4% for LASSOCV. 



4 Conclusion 

In this paper, we have outlined both the theoretical justification for a detailed MDL appli- 
cation to linear regression which concurrently provides both regularization, feature selection 
and model selection. At first glance, the complex derivation may appear to be a bit of an 
overkill. However, simpler MDL applications to the feature selection problem, which have 
been previously proposed (MDL as a model-selection is long in use for time-series linear 
auto-regressive modeling), do not perform as well to our experience. Why so? First of 
all because code length estimates based on Shannon entropy do not work well for small 
numbers of examples (hence the meticulous derivation of the spherical code) . The dataset 
results we present herein are precisely for small N problems, where there is not much to 
compress. In the case of auto-regressive time series modeling, it was previously shown that 
a simpler version of our method exhibits increasing sparsity structure recovery accuracy 
with increasing N. The same non-standard MDL/MML situation applies for large values of 
quantization width 5 with respect to standard deviation of the target. In fact, S can be (and 
was) inferred directly from the histogram of the target. Histograms of the modelling errors 
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Figure 3: Generalization performance on real data. The datasets are numbered as follows: 
1. Concrete (N=1030, J=8), 2. Forest Fires (N=517, J=12), 3. Detroit (N=60, J=15), 4. 
Housing (N=506, J=13), 5. Automobile (N=159, J=15), 6. Triazincs (N=186, J=31), 7. 
Pyrimidines (N=74, J=26). 

(residuals) on the "real" datasets showed that only one of these passed the Jarque-Berra 
test for normality (Detroit) yet CLR worked well for 2 other datasets. CLR did not perform 
well on 2 datasets, but neither did either LASSO variant. Note also that LASSO-CV highly 
overfit on one of the datasets, highlighting the overfit risk of cross-validation. Overall CLR 
was quite conservative, and it outperformed LASSO on the sparse simulated datasets (in 
avoiding overfit and recovering the structure). If there is any generally discernible trend 
in CLR, it is toward underfit, which is not suprising since we are restricting ourselves to a 
smaller class of models than that from which the 'generating' model may have belonged to. 

The CLR method is so conservative, in fact, that it also "works" in the underdetermined 
case, not only allowing us to use product features such as the square but also all combina- 
tions of feature product pairs, allowing for a full multi-dimensional polynomial fit. It may 
be surprising to note that CLR actually worked faster than the "convex" LASSO although 
it is not convex. This is because of the reliance on cross-validation and bootstrapping that 
hypcrparamcter choice traditionally requires, both in LASSO and other classification/regres- 
sion paradigms. As the number of features increases, the relative running time of gradient 
descent algorithms will increase relative to the LASSO, and further work will be necessary 
to provide for convex approximation to the MDL and customized, more efficient optimiza- 
tion algorithms. As it is known that feature selection is, per se, a non-convex, NP-complete 
problem [6j it is not likely that the computational time burden will decrease without addi- 
tional compromises. We'd like to assert, credibly, that there are no current ML approaches 
which arc fully convex, cither because of the hypcrparamcter choice or the adjustment of 
prior variances. However, our work points to the possibility of fully automatic learning, 
with no practitioner imposed choices necessary. We have already implemented heuristic al- 
gorithms which itcrativcly generate, and then cull/prune features to minimize MDL, which 
is an unbounded-time formulation. Since learning in general, at least under the algorithmic 
information theory perspective, is incomputable, it should not be surprising that heuristic 
approaches with practically imposed time bounds will be necessary. While the common 
Gaussian i.i.d assumption was used herein, the MDL formulation allows for learning under 
other, or even unkown, noise sources. The underlying philosophy of our ML approach is to 
spend computational resources on model generation and selection rather than validation. 
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